WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs
نویسندگان
چکیده
منابع مشابه
Blocking and non-blocking coordinated checkpointing for large scale MPI computation
Nowadays, clusters and grids are made of more and more computing nodes. The programming of multi-processes applications is the most often achieved through message passing. The increase of the number of processes implies that theses applications need to use a fault tolerant message passing library. In this paper, we present two implementations of fault tolerant protocols based on MPICH, a blocki...
متن کاملBlocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols
A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPIs has led to the development of several fault tolerant MPI environments. Different approaches a...
متن کاملC3: A System for Automating Application-Level Checkpointing of MPI Programs
Fault-tolerance is becoming necessary on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual application-level checkpointing by writi...
متن کاملApplication-level Checkpointing for OpenMP Programs
It is becoming important for long-running scientific applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) the state of the computation is saved periodically to disk, and when a failure occurs, the computation is restarted from the last saved state. One common way of doing this, called Systemlevel Checkpointing (SLC), requires modifying the Op...
متن کاملApplication-Level Checkpointing Techniques for Parallel Programs
In its simplest form, checkpointing is the act of saving a program’s computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every l...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEICE Transactions on Information and Systems
سال: 2012
ISSN: 0916-8532,1745-1361
DOI: 10.1587/transinf.e95.d.786